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Abstract 



The large fields of convex optimization and active learning have been developed fairly independent 
of each other, from the design of algorithms to the techniques of proof. Given the growing literature in 
both these subjects, we believe that understanding the connections between them is important to people 
in both areas. Here, we establish few such interesting relationships in upper and lower bound techniques 
that bring out these similarities. Our prime result is showing upper and lower bounds for precisely how 
the minimax rate for optimizing a given function depends solely on a flatness/noise condition for the 
^.J ■ function around its minimum. 



1 Introduction 



Almost all convex optimization algorithms are, by design, of a sequential nature, with future steps depending 
on the results of past actions. This gives them a very natural flavour found in active learning, which deals 
with sequential sampling strategies with the aim of minimizing a loss function. One can naturally ask if 
there is much in common between these two fields, given the natural similarity in stating their objectives. In 
t^. ■ this paper, we answer the above question in the strongly affirmative, by connecting concepts found in both. 

Furthermore, we demonstrate algorithms and proof techniques from one field to solve problems in the other. 
The central problem we deal with is stochastic convex optimization, as introduced in the next section. 

We generalise the well-known concept of ft-Uniform Convexity (re-UC), to a weaker notion of Gener- 
alised K-Uniform Convexity (k-GUC) that only describes the flatness of a function at its optimum. We prove 
lower bounds for how fast we can optimize such functions using methods from active learning, and show 
that these bounds are indeed achieved by a recent variant of gradient descent. This work implies a strong 
result, that a convex function's behaviour around its minimum is the only factor that the minimax rates for 
its optimization depend on, and that the function estimation error looks like 0(T 2k ~ 2 ) after T steps. While 
UC only allows k > 2, GUC allows k > 1 which yields rates much fast than 0(1 /T) for those functions. 
We also prove that the point estimation error, which is not often considered by the optimization community, 
looks like 6(T~ d=s). 

Our work bears some similarity to [6 ] where the authors use techniques for convex optimization analysis 
to derive lower bounds for active learning in one dimension. However, they do not derive rates or give 
algorithms for convex optimization, and our results are broader and richer. Another work that bears a 
resemblance is SI, which derives upper bound rates for UC functions. Our upper bounds are achieved by 
tuning the recent Epoch-Gradient-Descent for k > 1 whereas [4] analyze primal-dual subgradient methods 
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for k > 2. They also do not show lower bounds, or connections to active learning, and use the UC (as 
opposed to GUC). We get the same rates as they do of 0(T~ 2 «- 2 ), which vary smoothly between 0(T~ l ) 
for strongly convex functions and 0(T -1 / 2 ) for convex functions. However, UC does not permit 1 < k < 2, 
but GUC does, and for these, we can obtain rates that are faster than 0(T _1 ), which is quite surprising and 
interesting. 

To begin with, we define the oracle model for stochastic convex optimization from the seminal [ 5 ] and 
the more recent (T), who proved tight lower bounds for convex and strongly convex classes. On introducing 
the Generalised-K-Uniform-Convexity, we point out its relationship to the Tsybakov Noise Condition (TNC) 
that is popular in classification and level set estimation, and use this analogy to adapt an active learning 
algorithm to perform optimal minimization in one dimension. We show how to use active learning proof 
techniques from (2l to get minimax lower bounds for optimizing k-GUC functions in any dimension. Fi- 
nally, we get tight upper bounds by tuning the Epoch-GD algorithm [3] to achieve these rates in expectation 
and with high probability for all k > 1. 

2 Oracle Model of Stochastic Convex Optimization 

Stochastic convex optimization in the oracle model can be defined as the task of minimizing a d-dimensional 
convex function / over a convex set S € when given oracle access to unbiased estimates of the function 
value and gradient at any point in S, by using as few queries to the oracle as possible. We follow the setup 
of (51 and Q, and summarise what is necessary for completeness. 

A first order oracle is a (possibly random) function O : S — > R +1 , which answers a query x, by 
returning (f(x), g(x)) such that E[/(x)] = f[x) andE[g(x)] = g(x), where g(x) 6 df (x). We additionally 
assume / and g have unit variance. Let the class of all such oracles be called O. 

An optimization algorithm is any procedure that solves the task of finding the optimum x*j by repeatedly 
querying the oracle at different points in S. The method can decide which points to query at based on the 
results of earlier queries, and tries to use as few queries as possible to achieve its task. We normally assume 
that it has no further knowledge of the functioning of the oracle. Define M. T to be the set of methods that 
use T queries and finally return an estimated point xt- 

The central question can be posed as follows: How many queries will it take to get e-close to the optimal 
point?, or equivalently as How close can we get to the optimal point, given a budget of T queries (time- 
steps)?. We use the second framework in this paper. 

The error hound achieved by an algorithm can be measured in two ways, which we call point-distance 
and function-distance. For any My G Air that returns xt, we define the function-error and point-error 
with respect to function / after T queries in S to oracle O respectively as: 

e T (M T , /, S, O) = f{x T ) - mm f{x) = f(x T ) - f(x}) 

pr(M T ,f,S,0) = dist(x T ,x* f ) = \\x T - x* f \\ 2 

Given a class of functions T, the minimax error is then defined as the expected error (over the random- 
ness of the oracle) achieved by the game in which an adversary picks the oracle and the set, the learner 
picks an optimization procedure knowing S (but not the oracle's workings), and then the adversary picks a 
function that the learner must optimize. They can be defined formally as: 
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p* T (JP) = sup sup inf su P E [p(M T ,/,5,0)] (1) 



e* T (F) = sup sup inf sup E G [e(M T , /, 5, 0)] (2) 

In this paper, we shall deal with e^(J r ) and p^p{F) for different function classes F, and will be interested 
in how it scales with T. The dependance on d may not be optimal, but that will not be the concern of 
this work - we will instead try to give a fine characterization in terms of their dependence on T and k, a 
generalised uniform convexity parameter that will be introduced in the next section. 

3 Relating Uniform Convexity (UC) and Tsybakov Noise Condition (TNC) 

Given a closed, convex set S E M. d , let T (S) be the set of all strictly convex functions on S, ie, 

Vx,yeS, Vie [0,1], f(tx + (l-t)y) < tf(x) + (l-t)f(y) 

We consider this restriction only because it implies that / has a unique minimum on any set S. We could 
alternately define it to be the class of all convex functions that have a unique minimum in S. 

Let TV^(S) be the set of all (A, K)-uniformly convex functions on S. f is k-UC on S with UC parameter 
A, or / € T^(S) for k > 2 (necessary condition, Appendix), if we have 

Vx,yeS, Vie [0,1], f(tx + (l-t)y) < tf(x) + (l-t)f(y)-~\t(l-t)\\x-y\\Z 

k = 2 is well known as the class of A-strongly convex functions . As shown in H, if a uniformly 
convex function / is subdifferentiable at x, then for any subgradient g x e df(x), 

f(y) > f(.x)+9x(v-x) + ^\\x-y\\Z 
Taking x to be Xj and noting that € df(x*j), we get 

m-f(x* f ) > ^ii* 

Readers familiar with classification literature will recognize the similarity of this last expression to the 
Tsybakov Noise Condition (TNC) Q. The TNC is often used in classifications problems (given x, predict 
its label i(x)) to describe the flatness of the regression function i](x) = P{l(x) = l\x) around the decision 
boundary i](x) = 1/2. For example, in one dimension, the following condition is assumed to hold in some 
region around the minimum, for some value of k: 

\r](x)-l/2\ > clx-x*^- 1 

It describes how the signal to noise ratio (SNR) varies around the decision boundary x*, which in turn 
determines how easy or hard the classification problem is. For example, if k = 1, then r/(x) jumps by a 
constant at the decision boundary, making it quite easy to identify. [2] show that one can derive minimax 
rates for active classification that depend very precisely on k. A similar idea can be used to describe how a 
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function varies at the boundary in a level set estimation setting. One can ask a similar question in the convex 
optimization setting - Does the behaviour of a particular function around its minimum solely and precisely 
determine rates for the given function's optimization? 

Having established the clear connection between the notions of UC and TNC, we define the class 
T^k (S) for any real k > 1 as the class of Generalised (L, ft)-Uniformly Convex or (L, k)-GUC functions, 
and we say that / G J^ C (S) if / is strictly convex, ie, / 6 T and 

Vx£5, f(x)-f(x}) > L\\x-x* f \\S 

This can also be thought of as characterizing the SNR of the function near its minimum. Note that k 
need not be an integer, and also that a function may lie in several Fl u R c (S) classes. In such a case, what is 
most relevant is the minimum n for which the above condition holds, which is unique. 

In this paper, we will show how the minimax errors fyfjf^), Pri^L^') can ^ e cleanly and tightly 
specified, by providing tight lower and upper bounds in terms of T, k. We thus argue that the optimal rate 
of minimizng such a function is determined solely by its behaviour around its optimum. 



Remark 1. Consider the function f(x) = \\x\\\\in S = [—2, 2] d . f 6 = T^^for some A, because its 
second derivative is lower bounded on any closed set. This implies that f 6 > ^ ut s ^ nce our definition 

also allows 1 < k < 2, we can also get f £ T^\^, and this lower k = 1.5 allows us to get faster rates like 
0(T-V 2 ) than [4] who will get 0{T- V ). (Appendix) 



4 1-D Stochastic Optimization using Active Learning Algorithms 

In this section, we show how to reduce the task of stochastically optimizing a one-dimensional convex 
function to that of active classification of signs of a monotone gradient. For simplicity of exposition, we 
assume that the set S of interest is [0,1], and we have a convex function / that achieves a unique minimum 
x* inside the set (0, 1). 

We begin by noting that since / is convex, its true gradient g is an increasing function of x, that is 
negative to the left of x*, at x* , and positive to the right of x*. Hence, one can think of sign(g(x)) as 
being the label of point x, and finding x* corresponds to learning the decision boundary. 

In this section, we assume that the oracle returns gradient values corrupted by standard variance gaussian 
noise. Because this noise is symmetric, when we query at a point to the left of x* we are more likely to 
see a negative (label —1), when we are at x* we have a 50 — 50 chance of seeing a negative or positive 
(label 1), and to the right of x* we have a greater chance of seeing a positive. So, if we think of r](x) = 
P(sign(g(x) + z) = l\x), then minimizing the function corresponds to identifying the Bayes classifier 
[x*, 1]. In other words, the point at which rj(x) = 0.5 is the point at which g(x) = 0, which is x*. 

We now argue that an assumption of (L, k)-GUC for / implies that for any subgradient g x £ df(x), 
we have \\g x \\2 > M\ x ~ x /ll2 _1 (Appendix), which then implies a (c, k — 1)-TNC for rj. Here, we shall 
derive the second implication using the fact that the probability mass of a gaussian random variable z grows 
linearly just around its mean (Appendix), which can be stated as 



Vi < a, 3ai, a2, ait < P(0 < z < t) < a^i 
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Let us consider a point x which is a distance t > to the right of x* and hence has label 1 (we can 
make a similar argument for x < x*). As mentioned above, \/g x G df(x), g x > Lt K . In the presence of 
gaussian noise, the probability of seeing label 1 is the probability that a draw gets a value in (— g x , oo) so 
that the sign is not reversed. This yields: 

v ( x ) = P(g x + z>0) = 0.5 + P(-g x < z < 0) > 0.5 + atLt R ~ l 

=> 3c>0, \rj(x) - 1/2| > c\x - 

El analyse an algorithm called the Burnashev-Zigangirov (BZ) algorithm, which is a noise-tolerant 
variant of binary bisection, under such a TNC. BZ solves the one-dimensional active classification problem 
such that after making T queries for a noisy label, it returns a confidence interval It which contains x* with 
high probability, and xt is chosen to be the midpoint of It- They provide bounds for the excess risk of 
f[x l]A[x* l] l^ r ?( x ) ~ Mdx where A is the symmetric difference operator over sets but small modifications 
to their proofs yield a bound onE|icy — x*\. (Appendix) 

The bounded noise setting of k = 1 is easy because the regression function is bounded away from half 
and we can show an exponential convergence of E(|xt ~ x *\) = 0(e~ TL I 2 ). The unbounded noise setting 
of K > 1 is harder because the regression function does not jump and using a variant of BZ analysed in [2], 

1 K 

we can show that E( | x T - x* \) = O 2k ~ 2 and E(\x T - x*\ K ) = O (^^.y^ 2 . (Appendix) 

If / also obeys a Holder condition with exponent k, H\x — x*\ K > f{x) — f(x*) > L\x — x*\ K , we 

can immediately get a bound E[/(x) — f(x*)] = O (^tj ^ 2 • mteres tingly, in the next section on lower 

bounds, we show that for any dimension, Q (i) 2k ~ 2 is the minimax convergence rate for E([|scy — x*\\ 2 ) 
and that Q (i) 2k ~ 2 is the minimax rate for E[/(x) - f{x*)\. 



5 Optimization Lower Bounds using Active Learning Techniques 

Here, we prove lower bounds for e T , p T using an information theory technique that was originally used for 
proving lower bounds for active classification using the TNC [2], providing a stronger connection between 
active learning and stochastic convex optimization. Our main result is 

Theorem 1. For k > 1, let be the class of Generalised (L, K)-Uniformly Convex functions on M. d . 

Then, the minimax rate for function-error and point-error are given by 

The proof technique can be summarised as follows. We demonstrate an oracle O* and set S* over which 
we prove a lower bound for inf_A4 e ^ T sup^ e jrEo[e(.M, /, S, O)]. We go about this by defining a semi- 
distance between any two elements of our function class as the distance between their minima. We then 
choose two very similar functions /q , fi whose minima are 2a apart (think of a as a small constant, getting 
smaller with increasing T). The oracle chooses one of these two functions and the learner gets to query 
at points x in domain S* , receiving noisy gradient and function values y, z. We then define distributions 
corresponding to the two functions P T , P T and choose a so that they are at most a constant KL-distance 
7 apart. We then use a classical Fano's inequality which, using a and 7, lower bounds the probability of 
identifying the wrong function by any estimator (and hence optimizing the wrong function), given any finite 
sample of size T. 
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Theorem 2. jj^jl Let T be a model class with an associated semi-distance 5(-, ■) : J- x T — > K and each 
f G T having an associated measure on a common probability space. Let /o, /i G J- be such that 
S(fo, fi) > 2a > and KLiP^P 1 ) < 7. Then, 

inf sup Pf (S(f, f) > a) > inf max P* (s(f, fj) > a) > max [ ggjpO , 1 ~ ^ ) 
f feT v J f jg{o,i} V / V / 

For the set of generalised (L, K)-uniformly convex functions J^L > we cnoose tne set S* to be [0, l] d . 
The chosen oracle O* just adds standard normal noise to the true function and gradient values. We first 
consider a subclass Uj^l C , which is chosen such that every point in S* is the minimizer of exactly 

one function in (also, / G has a unique minimum xj G 5*). 

We now bound inf ^ sup j gU guc M\\x"t — x*\\. Define semi-distance S(f a , fb) = \\ x * a ~ x t 1 1 an ^ 

5o (x) = KdfL(x^- 1 ,...,x^ 1 ) 




Xi\ K + (4a) K - (2a) K J if n G [0, 4a] 

otherwise 

<7o(a0 otherwise 

for an appropriate value of a. The minima of these two functions are at (0, 0, 0) and (2a, 0, 0) respec- 
tively, and hence 6(fo, fi) = 2a. Notice that these two functions and their gradients differ only on a set of 
size 4a. The functions are both convex and both in F^k • (Appendix) 

On querying at point x, the oracle returns z ~ M (f (x) , a 2 )) and y ~ J\f(g(x),a 2 Id). So, for i G {0, 1} 
we have P l (Z t ,Y t \X = x t ) = M {(fi(xt), 9i(xt)), <r 2 /d+i). Let Xj,Yi,Zj be the random variables 
corresponding to the set of query points and responses. We define the probability distribution corresponding 
to every / G U^k to ^ e their joint distribution over T samples, and so Pj, := P°(X'[, Y^, Zf) and 

/','- PHxl.yJ.zi). 

The KL-divergence of these two distributions can be shown to be KL(Py,Py) = 0{d K L 2 Ta? K 2 ) 
(proof in Appendix). We choose a = (d K L 2 Ty 2^2 so that ^7 > 0, KL(P£, P^) < 7. 

Since we satisfy the conditions of the theorem, we get inf t sup feU Guc Pr(5(f, f) > a) > C for some 
constant C. It immediately follows that 

inf sup E\\x T -x* f \\ > a -inf sup P f (5(f,f)>a) > a • C = n((d K L 2 Ty^ 

It feuguc f T feU guc V 



where the first inequality follows is an application of Markov's inequality, the second follows by the appli- 
cation of the aforementioned Fano's theorem, and the last step follows by the choice of a. This gives us our 
required bound on Pt(^Lk)' and correspondingly for e^(U^ c ) because 

inf sup E[/(xV) - f(x*f)] > inf sup L[E\\xt - x* f \\ K ] > inf sup L\E\\x T - x* \\} K 
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where the first inequality follows by the TNC, and the second follows by applying Jensen's inequality for 
K > 1 and because a method M.t returning a point xt G S* corresponds exactly to it returning a guessed 
function f T G U^ c (by construction of Ul u K c ). 

Finally, we get the bounds on Pt(^Lk) anc ^ ^O^zTk^) because we are now taking sup over the larger 
class Tf U K C . We carry the dimension dependence around to show that we do not derive a contradiction to 

[ABRWIO], who show that e* T (T c ) = fi C\fr) and 40^°) = ^ (#)• 



6 Tight Upper Bounds using Generalised Epoch-GD 

In this section, we demonstrate an alternate algorithm to the dual-averaging one in [4] that gets the same 
rates in expectation and with high probability. We consider the Epoch-GD algorithm from [3], that was 
shown to be optimal when parameterized correctly for strongly convex optimization in expectation and in 
high probability (upto logarithmic factors), and show that again, when parameterized correctly, the exact 
same algorithm achieves our required rates, for k > 1. 

We make the assumption that / is known to be in for some k > 1; [3] use k = 2, while 

need k > 2. We also assume (like (3]], (4]) that the oracle always returns a subgradient estimate g x such that 
Kg x G df(x) and satisfies \\g x \\2 < G. This assumption is like a Lipschitz condition that also implies that 

the subgradient is bounded by G everywhere. In fact, it implies that the diameter is bounded by (GL~ ) k- 1 

i 

and the function value is bounded by (G K L~ ) «-i (Appendix). 

The only difference in our EpochGD algorithm (described in table) from [3] is the update for r\ e (we get 
back their update when k = 2). The algorithm runs for E = [log(^ + 1)J epochs so that the total number 
of gradient queries is bounded by T. The following theorems can by shown by mimicing the proof in ll3l . 
Note that, because / G J 7 ^ ' , we have \\xt — x*\\ < L^ l / ri [f{xT) — f{x*)] 1 ^ K , giving immediate bounds 

TP I 1 

on the point-error of the final guess xt = x 1 . 

Theorem 3. There exists an algorithm EpochGD(S, G, L, k, T) for any f G > K — 2, that after at 

most T gradient queries, returns a point xt G S, such that E/(xt) — f(x%) = 0(T~ 2k ~ 2 ) and by Jensen's, 

i 

Epr - x}\\ < (e(||x t - K = 0(T~*£z). 

The k > 2 condition seems to be an artifact of analysis. We think the bound is true for all k > 1, and we 
support this claim by demonstrating a high probability bound with no restriction on k, which immediately 
implies the bound in expectation as well, for all k > 1. 

To prove a high probability bound, O use a different projection operator, that now looks like risnB(x e r) > 
meaning they project (using a convex program, say) onto a convex set that is an intersection of the original 
set and a ball centered at x\ of radius r. They show how to set r, Co, C\ in terms of A, G, T to get the 
minimax rate of 0(1/T). We also use exactly the same projection, and show (Appendix) how to similarly 
set r, Co, C\ in terms of L, k, G, T to obtain : 

Theorem 4. There exists an algorithm EpochGDProj(S, G, L, k, T, 5) for any f G Tfi c , k>1, that 
after at most T gradient queries, returns xt G S, such that J{xt) — f(x*f) = 0(T~ 2k - 2 ) with probability 
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Algorithm 1 EpochGD (set S, gradient bound G, time steps T, GUC parameters k, L) 
Input: Constants Cq,C\,k and total time T. 
Initialize x\ G S arbitrarily 

Initialize T X = C Q - 2, r) X = C\- and e = 1 

l: while Ei=i T i <Tdo 
for t = 1 to T e do 

Query the oracle at x\ to obtain g% 



end for 

Set = tjt Y2t=i x t 

Set T e+1 = 2T e , ?? e+ i = 7/ e • 2~^ and e <- e + 1 



end while 
Output: x\ 



at least 1 — 5 for any 5 > 0, where O hides log log T and log (1/5) factors. Hence, \\xt~ %f\\ = 0(T 2^-2) 
wjY/z probability at least 1—5. 

The only assumptions are that we know / G for « > 1, and that we have a bound on any returned 

subgradient G. (H, make the same assumptions, but [4] need k > 2, while Q assume k = 2, and hence 
don't get rates better than 0(1 /T) like we do for 1 < k < 2. 

7 Discussion and Future Work 

The most common assumptions in the literature for proving convergence results for optimization algorithms 
are those of convexity and A-strong convexity, and [4] prove results for (L, k)-UC when k > 2. The concept 
of (L, k)-GUC is a strictly weaker notion because it is immediately implied by (L, k)-UC in the realm of 
k > 2, but has no corresponding notion when 1<k<2. k — > 00 corresponds to flatter and flatter functions 
around their minima while k — > 1 is actually the best case with a large SNR (called the bounded noise case 
in Section 4, as done in [2]) and one can achieve extremely fast rates for this case, that are surprisingly even 
faster than 0(1/T). 

The lower bound VL(T~ 2k - 2 ) for e* that we prove don't contradict those in (I), who show that their 
method gives the correct dependence on dimension d as well, but we really wanted to show how the rate 
decays with T, k. Setting k = 2 for strongly convex functions, we do recover the well-known 0.(1 /T) lower 
bound. Also, letting k — )■ 00, gives the classic 0,(1 /vT) bound. We also wanted to demonstrate how to use 
an active learning proof technique, which is novel in its application to optimization, and we believe that it 
can be modified to give tight rates in d, with a better construction. 

1 

The lower bound 0(T 2«-2j for p* is interesting because the optimization literature does not often 
focus on point-error estimates. We note that these are strongly supported by intuition as we can note by the 
rate's behaviour at the extremes of k. If the function has k — > 1, it says that we should be able to identify 
the optimum extremely fast, as supported by our result for the bounded noise setting in 1-D, and also by the 
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tight upper bounds for p using Epoch-GD. However, when k — > oo, the function starts to look extremely flat 
around its minimum, and while we can optimize function-error very well (because a wide range of points 
have function value close to the minimum value), we cannot expect to get close to the true optimum point. 

Our upper bounds on e and p, when we know that the function is in J 7 ^ , involve an appropriately 
tuned Epoch-Gradient-Descent [3] and the rates match those of El who use a dual-averaging algorithm, 
showing that the lower bounds achieved in terms of T, k are indeed correct and tight. It is important to note 
that we make the same kind of assumptions as [4| and Q - the number of time steps T, a bound on noisy 
subgradients G, a convexity parameter L, and knowledge of k (k = 2 for O, any k > 2 for [4], any k > 1 
for us). Substituting k = 2 in our algorithm yields their bound of 0(1/T) for strongly convex functions (as 
well as the same parameter settings as in Q). Also, n — > oo recovers the classical rate of 0(1/ VT) for 
convex functions as well. 

In practice, one does not usually know the smoothness of the function at hand, and hence what value 
of k to use in the proposed algorithm. Of course, if we only know that the function is convex then we can 
use any gradient descent algorithm, and if we know that it is strongly convex then we can use k = 2, so our 
algorithm is not any weaker than existing ones, but it is certainly stronger if we know n accurately. [4] have 
an algorithm in the non-stochastic setting that adapts to unknown k with the loss of only a log T term in the 
rate. However, in the stochastic setting, it is an open problem to c construct an algorithm that can adapt to 
unknown k. 

Every function has a unique smallest k that it can possibly satisfy strong convexity with (because there 
is an inherent flatness to the function at its minimum, and we cannot satisfy GUC with a k smaller than 
that), and this k should be learnable with access to noisy function values and gradients. For example, if 
the function is simple, in the sense that it doesn't have different rates of smoothness in different areas, then 
perhaps we can spend half our budget of T queries just querying at a point and estimating its gradient in 
random directions around it to get a good estimate of k, and then run the algorithm using this estimate. Of 
course, it is also not clear how the algorithms perform when they use a wrong value for k (sensitivity of 
convergence rates to a possible estimation error in k). 

The lower bound proof proposed here is useful because it bounds e* and p* simultaneously, by bounding 
the point-error p and using the GUC condition to bound the function-error e. Also, notice that the upper- 
bound proofs proceed by bounding the function-error e and use the GUC condition to bound the point-error 
p. We conjecture that the same proof should be alterable to get the right dependence on d, T, n simulta- 
neously, using a larger set of functions (say associated with corners of a hypercube), each function having 
its optimum perturbed in different dimensions (according to the Is of its corner). Also, going by the lower 

bounds of O (|) for F sc and O (Jif) for T c , one might guess that the right dependence on d, T, k 



Our upper bound proofs have a few loose ends. It should be an interesting exercise to get rid of the 
k > 2 condition for the simpler expectation argument, and remove the log log T in the high probability 
argument. The log log T factor also appears in the analysis by Q, so a tighter analysis in that setting should 
immediately lend itself to improvements in our bounds. However, this is possibly a secondary concern 
compared to learning or adapting to k and getting the right dependence on d, k. 

Hints of connections to active learning have been lingering in the literature, as noted by [6], but as far 
as we know, nobody has explicitly used concepts, algorithms and proof techniques to connect the two fields 




should look likeO((|) 
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strongly. It is interesting to note, however, that while many active learning methods degrade exponentially 
with dimension d, the rates in optimization degrade polynomially. This may limit the use of algorithms 
from active learning, which are possibly trying to solve harder problems, like learning ad — 1-dimensional 
decision boundary or level set, while optimization problems in any dimension are really interested in getting 
to a single good point. However, we feel that this is just the start of stronger conceptual ties between the two 
fields. 
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SECTION 3 

We now justify the claim that no function (including f(x) = ||x||}'5 = Yli l^il 1 ' 5 ) can satisfy Uniform 
Convexity for re < 2, but they can satisfy Generalized Uniform Convexity for re < 2. 
If uniform convexity could be satisfied for (say) re = 1.5, then Vx, y & S 

f{y)-f{x)-gl(y-x)>±\\x-y\\ l 2 5 
Take x, y both on the positive x-axis. The Taylor expansion would require, for some c G [x, y], 

f(y)~ f{x)-gl(y-x) = (x-y) T H(c)(x-y) < \\H(c)\\ F \\x - y\\ 2 2 

Now, taking \\x — 2/ H2 = e — ^Oby choosing x closer to y, the Taylor condition requires the residual to 
grow like e 2 (going to zero fast), but the UC condition requires the residual to grow at least as fast as e 15 
(going to zero slow). At some small enough value of e, this would not be possible. Since the definition of 
UC needs to hold for all x,y G S, this gives us a contradiction. So, / ^ ^Li5- 

However, one can note that xj = 0, and f(x) — f(x*j) = = \\x — XjH};!, hence / G F^Y.b- 

SECTION 4 

This section deals with reducing ID convex optimization to active learning of gradient signs. 
Lemma 1. If f G Jf^, then for any subgradient g x G df(x), we have \\g x \\2 > L\\x - ■■ ! ' ~ 1 

Proof. By convexity, we have 

f(x*)>f(x) + gj(x*-x) 
Rearranging terms and since / G J 7 ^, we get 

gJ(x-x*)>f(x)-f(x*)>L\\x-x*E 

By Holder's inequality, 



\9xh\ x - x *h > 9 X ( x - x *) 

Putting them together, we have 

llffxlbll^; — a?* 1 1 2 > L\\x — x*\\% 
giving us our result. □ 

Lemma 2. For a gaussian random variable z, V£ < a, 3a±, 02, a±t < P(0 < z < t) < a,2t 
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Proof. We wish to characterize how the probability mass of a gaussian random variable grows just around 
its mean. Our claim is that it grows linearly with the distance from the mean, and the following simple 
argument argues this neatly. 

Consider al~ N(0, a 2 ) random variable at a distance t from the mean 0. We want to bound J l _ t dfi(X) 
for very small t. The key idea in bounding this integral is to approximate it by a smaller and larger rectangle, 
each of the rectangles having a width 2t (from —t to t). 

The first one has a height equal to e , the smallest value taken by the gaussian in [— t, t] achieved 
at t, and the other with a height equal to the —j^= , the largest value of the gaussian in [— t, t] achieved at 1. 

The smaller rectangle has an area of 2t- — m — > 2t e 7= when t < a. The larger rectangle clearly has 

(7 V Z7T <jy Z7T 

an area of 2t — \==. 

Hence we have Ait = 2t I < P(\X\ < t) < 2t—j= = A 2 t for t < a. Similarly, for a one-sided 

(TV27re \ii / t7V27r 

inequality, we have a\t = t — i — < P(0 < X < t) < t — \= = a 2 t for t < a. 

o"V lire o"V Iir 

We note that the gaussian tail inequality P(X > t) < je~ t2 / 2fj2 really makes sense for large t > a and 
we are interested in t < a. There are tighter inequalities, but for our purpose, this will suffice. □ 

We now move to proving the key results claimed in the section. 

Lemma 3. If |r/(x) — 1/2| > L, the midpoint xt of the high-probability interval returned by BZ satisfies 
E\x T -x*\ = 0(e- TL2 / 2 ). 

Proof. This is a subpart of a proof from [CN07], who note that the BZ algorithm works by dividing [0, 1] 
into a grid of m points (interval size 1/m) and makes T queries (only at gridpoints) to return an interval It 
such that Pi(x* ^ It) < me~ TL . We choose xt to be the midpoint of this interval, and hence get 



E|x T - x* 



/ Pr(|x-r — x*\ > u)du 
Jo 

rl/2m f-1 

/ Pr(|xT — x*\ > u)du + / Pr(|xT — x*| > u)du 

JO J\l2m 



1 



2m 



< h me 

2m 



2m 
-TL 2 



< + 1 Pr xt-x* > 



2m 



O (e- TL2 l 2 ) 



for the choice of the number of gridpoints as m = e TZ/2 / 2 . □ 

Lemma 4. If \r](x) — 1/2 1 > L\x — x*\ K , the point xt obtained from a modified version of BZ satisfies 
E|x r -x*| = o((^ r L )^) andE{\x T - x*\ K ] = O ((^)^y 

Proof. We again follow the same proof as in [CN07]. Initially, they assume that the grid points are not 
aligned with x*, ie VA; G {0, m}, \x* — k/m\ > l/3m. This implies that for all gridpoints x, \i](x) — 
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1/2 1 > L(l/3m) K 1 . Following the exact same proof above, with this new L, 

E[\x T -x*\ K ] = [ Pr(|x T -xT >u)du 
Jo 

Al/2m) K rl 

= / Pv(\x T -x*\> u 1/K )du+ Pr(\x T - x*\> u 1/K )du 

JO J(\l2mY 



£ l^)' !+ ( 1 -(i)") p ''( l * r - :c * l> i) 
< ( £;)" + »exp(-T^(l/ 3n ,)— ) = O ((^^ 



on choosing m proportional to 



T 



1 

2k-2 



In [CN07], they elaborate in detail how to get rid of the assumption that the grid points don't align with 
x* . They use a more complicated variant of BZ with three interlocked grids, and gets the same rate as above 
without that assumption. The reader is directed to their exposition for clarification. □ 



SECTION 5 

Lemma 5. df L£? =1 \xi\ K =: f (x) G F% U K C , for all k > 1 

Proof. Firstly, this is the sum of convex functions and is hence convex. Also, fo(x^ o ) = at x*^ Q = 0. So, 
all we need to show is that fo(x) - /o(^/ ) > \\ x ~ x } or i n otner wor ds 

1/re 

1 , 

x h 



d ( d \ l 

d^LY] \xi\ K > L\\x\\% y^kiT > —? 

i=i " \i=i / V ' 



which is true since if k < 2, we know II 

■e||« ^ || 2- 1| 2; snd as k — y oo, ||37||<x> 
fi is continuous at x\ = 4a (in fact the constants were chosen that way). The gradient at x\ = 4a 
increases from nd^ L(2a) K ~ 1 to nd^ L(4a) K ^ 1 . Hence convexity is preserved at the kink. Since both parts 
of /i are convex, and convexity is maintained at the kink, we conclude that fi is convex. 

Now, look at fo{x) for x\ < 4a. It is actually just fo(x), but translated by 2a in direction x±, with a 
constant added, and hence has the same GUC parameters. Now, the part with x\ > 4a is just fo(x) itself, 
which have the same GUC parameters as the part with x\ < 4a. So fi(x) € FlL also. □ 

Now we bound the KL divergence of the two probability distributions P l (zt,yt\X = xt) = N((fi(xt), 
9i (x t ), a 2 I d+1 ) and i* := P*(X?, , Zf) (for i = 0, 1). 



Lemma 6. For Pj,, P\ as defined in terms off , f 1 , KL(P^, Pj) = 0(Ta 



2k-2\ 
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KL(P°,P^) 



< 



< 



E u 



E 



log 



Y$ =x P\Y t ,Z t \X t )P{X t \X{-\Yt } 



(3) 



log rf 



Uj =1 P°(Y t ,Z t \X t 



t=i 



YlUPHYtiZtiXt 

P°(Y t ,Z t \X-, 



E' 



T max E° 

xe[o,i] d 



T max E 1 

X'£[0,l] d 



log 
log 
log 



P\Y u Z t \X t 
P°(Y 1 ,Z 1 \X 1 ) 



Xi, ...,Xt 



P^(Y X ,Z X \X X 

p°(yi|Xi)p°(Zi|Xi 




max E 
xe[o,i] d 



log 



p°(Yi\Xi 



x, 



(4) 







Xi = x 


+ max E° 


xe[o,i] d 



max \\g (x) 
xe[o,i] d 



Pi{Y x \X x 

-31(^)111+ max (/ (x 



log 



P°(Z 1 \X 1 ) 



PHZi\Xi) 



K max 

a:iS [0,4a] 
,2k-2 



Zl 



ie[0,l] d 

2a\ K 



(x\ — 2a) 



x 



K-l 



+ max (|xi — 2a\ K 

xie[0,4o] 



K\2 



X?) 



0{d K L To, ~ ) + 0(d K L Ta ) = 0(d K L Ta 



(5) 

(6) 
(V) 



(0) follows because the distribution of X t conditional on X* -1 , Yf -1 , depends only on the algo- 
rithm TWt and does not change with the underlying distribution. (@]) follows because conditioned on Xt, 
Y t _L Z t . We also used (Yj, Zi\X{) _L (Yj, Zj\Xj) for i 7^ j. © follows because the KL-divergence be- 
tween two identity-covariance gaussians is just half the squared euclidean distance between their means. © 
follows by simply substituting the gradient/function values and because the functions/gradients differ only 
on x\ G [0,4a]. (O follows by checking values at 0,2a, 4a, and the smaller power is larger order since 
a < 1, treating k as a constant. 



SECTION 6 



We begin by showing that the assumption of / having a bounded subgradient on S corresponds to assuming 
a bound on the diameter of S, and hence on the maximum achievable function value. 

Lemma 7. Iff G Tf v K c with ||V/(x)|| 2 < G, then Vx G 5, we have \\x - x* f \\ 2 < (GL" 1 )^ =: D 

1 

(diameter) and also that f(x) — fix**) < (G R L~ ) »-i =: M (maximum) 

Proof. This follows the corresponding proof in [HK11]. For any x G S 1 , let <7 X . G df(x). By convexity, 

/(a:*) > /(a?) + sj(x* f - £c), and so /(x) - /(xt) < gj (x - x* f ) < \\g x \\-t\\x - x%\\ 2 (by Holder's 
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inequality), implying that G\\x — x*^ > f{x) — f(x*j) > L\ 

From this we get \\x — Xf\\% < G/L or \\x — Xf\\ 2 < G"- 1 /!"- 1 . Finally f(x) — f{x*j) < 

K 1 

G\\x — x*f\\2 < G K - 1 /L K - 1 . Note that for strongly convex functions, k = 2, and [HK11] observe that 

f(x) - f{x*) < G 2 /L and \\x - x*\\ 2 < G/L. □ 

Lemma 8. [HK11] Applying T iterations of the update Xt+i = ris( x * ~~ V9t)> where c/t is an unbiased 
estimator for a subgradient gt of convex function f at Xt that satisfies \\gt\\ < G, we get the following bound 
forx = i J^t x t 

nG 2 \\xi-x* f \\ 2 
E/(x) - fix)) <V+ 27]T f 

Define A e = f{x\) — f(x)). The corresponding proof in [HK11] shows EA e < 2G 2 rj e . We point out 
that the bound for Ae+i with E = [log(^ + 1)J yields Theorem 3 immediately. 

Lemma 9. For any e, with T e = C$2 e , rj e = C\ - 2~ e2 ^- 2 ,for appropriate Co, C\, C% we have 

EA e < C 2 r] e 

Proof. We choose Cq = 1 and prove the lemma by performing induction on e. 

The first step of the induction, for e = 1, requires Ai < C2771 = C2C\T~ 2k - 2 . [Rl] 

Assume the hypothesis is true until e, and we prove it for e+1. Let E e denote the expectation conditioned 
on the randomness until epoch e. By Lemma [8) 

Es[A „ +l] <^ + l4z|l«!<^ + 



2 2rj e T e ~ 2 2r] e T e L 2 / K 

where the second inequality follows because A e > L\\x\ — x*\\ K . 

Now taking expectation for all epochs upto e, and applying Jensen's for k > 2, we get 

n r 2 w/\ i 2 / K t? r 2 c 2 I k2 I k 



2 2r] e T e L 2 / K ~ 2 2r] e T e L 2 / K 
where the second inequality follows by the induction hypothesis). 

q2/k 2/k 2 

Now, we would like the second term - 2 ^ r i e 2 / K < ^V - , so that their sum is <f] e G 2 . [R2] 
We now want the sum of the two terms rj e G 2 < C 2 r] e+ i, so that the induction can go through. [R3] 
We will now show values for C\ and C2 for which all 3 conditions hold (we chose Co = 1). 
From [R3], we derive C 2 > G 2 -^-^ = G 2 2 2k ~ 2 , giving a lower bound for C 2 . 

From [R2], we derive rjf^ > Xllu ^ Ve > ffi^i x 2 _e ^. Since % = d • 2~ e ^ 

— 2 /-t 2 ^ K 

we get 61 > fc 2 — — > fc 1 — = ^ , giving a lower bound for Ci using G 2 . 
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2 — K 



For [Rl] to hold, note that its right hand side is C 2 m = C 2 C\2 2 «- 2 > G 2 2^-2 QEZl^l ~^ -2 a-.-2 = 

GJEL . 22("-i) 2 = M • 2 2 ( K - 1 ) 2 . Hence, [Rl] requires us to show that Ai is smaller than the RHS, which 

is at least as large as M ■ 2 2 ("- 1 ) 2 . Since k > 1, this is trivially true, since we already know that Ax < M 
(Lemma [7]). 

2-K H ry 

Hence, we have proved the lemma for the constants Co = 1, C\ = — ^ , C*2 = G 2 2 «- 2 . For 

the result in [HK1 1], C = 1, C\ = 2/L, C 2 = 2G 2 works, which we get with k = 2. □ 

Lemma 10. [HK11] Let R be an upper bound on \\x\ — x*^\\. Applying T iterations of the update a^+i = 
^-SnB(x\,R){ x t — V9t), where gt is an unbiased estimator for the subgradient of convex f at Xt satisfying 
|| & || < G. For x = 7f Xt an d an y <5 > 0> with probability at least 1 — 5, we have 

Define A e = f{x\ ) — f{x*j). The corresponding proof in [HK11] shows A e < 6G 2 7] e with probability 
at least (1 — -§) e_1 - We point out that the bound for Ae+i with E = [log(^ + 1)J yields Theorem 4 
immediately by noting that (1 — > 1 — S. 

Lemma 11. For any epoch e and any 5 > 0, T e = C 2 e , E = \}og{^ + 1)J, rj e = Ci2~ e ^^ , for 
appropriate choice of Co, C\, C 2 , we have with probability at least (1 — -§) e_1 

A e < C 2 rj e 

Proof. We let 5 = ^ and prove the lemma by induction on e. 

The first step of induction, e = 1, requires Ai < C 2 r]\ = C 2 C\2 2 «- 2 . [Rl] 

Assume that A e < C 2 i] e for some e > 1, with probability at least (1 — 5) e ~ l and we now prove it for 
epoch e + 1. We condition on the event A e < C 2 r\ e which happens with the above probability. By the GUC, 
A e > l\x\ — x*\ K , and the conditioning implies that \x\ — x*|| < {C 2 r] e /L) l / K , which we choose as the 
radius r of the ball for the EpochGDProj projection. 

Lemma [lOl applies with R = (^^ £ )« and so with probability at least 1 — 5, we have 



Ve G 2 \\xl-x*f 4G{^)*yfiW$) ^ c l v } AG^^T) 

where the second inequality again follows because ||xf — x*|| < {C 2 r] e / L) 1 ^ . 

We would now like the second term — 2 % < -^-p— [R2] and also the third term -j=± — < 

2r/ e T e L« VT e 

^j- [R3], so the sum of all three terms would be < r\ e G 2 . 

Lastly, we would like rj e G 2 < C 2 r] e+ i [R4] so that the induction goes through. 
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Then, factoring in the conditioned event which happens with probability at least (1 — 5) e 1 we would 
get A e+ i < C2?7e+i with probability at least (1 — 5) e . 

Now, we show values for Co, d,, C2 such that the four conditions hold. 

For [R4], we need i] e G 2 < C 2 r]e+i, and hence we get C 2 > G 2 2 2k - 2 , a lower bound for C 2 . 

For [R2], we need > 2 Ve 2 «4> 7? e K > — ^ — 2~ e from which we get that % > 

b 2 Ve T e L- G 2 LkC 

(wce)^ (2)^ 2 " 6 ^' d > (3^)^ since „ B = C l2 "^. 

For [R3], we need ^ > ^>V*^ ^ ^ > ^ggffl , which yields C, > 

( 3(96 G°Co /5)) ) 2K ~ 2 (t) ^ since % = Ci2 _e ^. This is the stronger condition on Ci. 

For [Rl] to hold, we note that its right hand side is C 2 r)i = C 1 C 2 2~ 1 ^ > G 2 2^ ( 3(96 ( ^q /?)) ) ^ 
2-27^2 = G^L / 288iog(i/g) \ 2-2 2 5(^ = M2 k^ if c = 2881og(l/£). So, [Rl] 

requires that we need that Ai to be smaller than the RHS which is larger than M2 2 ^ K -^ 2 , which is trivially 
true since Ai < M (Lemma U} and k > 1. 

2-k S — ^ 

Hence, we have proved the lemma for C = 288 ]og(E/S), C\ = (?7 ^ T ^~ 1) , C 2 = G 2 2^ . 

[HK11] use C = 2881og(£/£), Ci = 2/L, C 2 = 2G 2 works, which we get with k = 2. 

As with [HK1 1], because of the changed T\, they lose a factor of log log T, because the total number of 
epochs is now smaller. Another way of seeing this, like in [HK1 1], is to allow the total number of epochs to 
be E = Llog(^g + 1)J instead of E = LM J + 1)J = 288 iog(g/<5) + - Then ' the al g° rithm runs 
for T log log T steps, to give a bound of 0(1 /T) with high probability. We can easily reverse this to show 
that the algorithm runs for T steps, to give a bound of 0(log log T/T) with high probability. □ 
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